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Abstract 

We present some new density estimation algorithms obtained by bootstrap ag- 
gregation like Bagging. Our algorithms are analyzed and empirically compared 
to other methods found in the statistical literature, like stacking and boosting for 
density estimation. We show by extensive simulations that ensemble learning are 
effective for density estimation like for classification. Although our algorithms do 
not always outperform other methods, some of them are as simple as bagging, more 
intuitive and has computational lower cost. 

Keywords: Machine Learning, Histogram, Kernel Density Estimator, Bootstrap, 
Bagging, Boosting, Stacking. 

1 Introduction 

Ensemble learning is one of the most challenging recent approaches in statistical learn- 
ing. Bagging (pjj, Boosting (|5J), Stacking ([2]), and Random forests ([3]) have been 
declared to be the best of the chelf classifiers achieving very high performances when 
tested over tens of various datasets selected from the machine learning benchmark. All 
these algorithms had been designed for supervised learning, sometimes initially restricted 
to regression or binary classification. Several extensions are still under study: multivari- 
ate regression, multiclass learning, and adaptation to functional data or time series. 
Very few developments exist for ensemble learning for unsupervised framework, clustering 
analysis and density estimation. Our work concerns the latter case which may be seen 
as a fundamental problem in statistics. Among the last developments, we found some 
extensions of boosting ([5]) and stacking ([ill]) to density estimation. 
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In this paper we suggest some simple algorithms for density estimation in the same spirit 
of bagging and stacking where the weak learners are histograms. We show by extensive 
simulations that aggregation gives rise to effective better estimates. We compare our 
algorithms to several algorithms for density estimation, some of them are simple like His- 
togram and kernel density estimators (KDE) and others rather complex like stacking and 
boosting which will be described in details. As we will show in the experiments we do, 
although the accuracy of our algorithm is not systematically higher than other ensemble 
methods, it is with no doubt simpler, more intuitive and computationally less expen- 
sive. Boosting algorithms and stacking for density estimation are described in section 2. 
Section 3 describes our algorithms. Simulations and results are given in section 4 and 
concluding remarks and future work in section 5. 

2 A review of the existing algorithms 

In this section we review some density estimators obtained by aggregation. They may be 
classified in two categories depending on the aggregation form. 
The first type has the form of linear or convex combination: 

At 

/mO) = ^ a ™9m(x) (1) 
m=l 

where g m is typically a parametric or non parametric density model, and in general 
different values of m refer typically to 

• different parameters values in the parametric case or, 

• different kernels, or 

• different bandwidths for a chosen kernel for the kernel density estimators. 

The second type of aggregation is multiplicative and is based on the ideas of high order 
bias reduction for kernel density estimation ([§]). The aggregated density estimator has 
the form: 

At 

/mO) = Y[ a m9m(x) (2) 
m=l 

2.1 Linear or convex combination of density estimators 

This kind of estimators ([I]) has been used in several works with different construction 
schemes. 
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In [JO], [5] and [J2] the weak learners g m are introduced sequentially in the combi- 
nation. At step m, g m is chosen to maximize the log likelihood of 

f m (x) = (1 - a)/ m _i(x) + a# m (x) (3) 

where <7 m is a density selected among a fixed class H. 

In [10] g m is selected among a non parametric family of estimators, and in [8] and 
[12J, it is taken to be a Gaussian density or a mixture of Gaussian densities whose 
parameters are estimated. Different methods are used to estimate both density g m 
and the mixture coefficient a. In [S] g m is a Gaussian density and the log likelihood 
of ([3]) is maximized using a special version of Expectation Maximization (EM) taking 
into account that a part of the mixture is known. 

The main idea underlying the algorithms given by [10] and [12] is to use Taylor 
expansion around the negative log likelihood that we wish to minimize: 

E - logUmfc)) = E - m/-i(^)) - « E + °(« 2 ) 

For a small we have the approximation 



E -iog( f m (xi)) ~ E -1 °g(^ i ( Xi )) _a E 



fm—l 



thus, minimizing the left side term is equivalent to maximizing £\ j , ,. , ■ 

All the algorithms described above are sequential and the number of weak learners 
aggregated may be fixed by the user. 

[TTj use stacked density estimator applying the same aggregation scheme as in 
stacked regression and classification ([13]). The M densities g m are fixed in ad- 
vance (KDE with different bandwidths). The data set C = {x\, . . . , x n } is divided 
into V cross validation subsets Ci, . . . , £y- For v = 1, .., V, denote ES~ V ^ = C — C v . 
The M models g 1 , . . . , g M are fitted using the training samples . . . , £(~ v \ the 

obtained estimates are denoted ~gm l \ • • • , dm V ^ for all m = 1, . . . , M. These models 
are then evaluated over the test sets Ci, . . . , Cy, getting the vectors gin V \C V ) for 
m — 1, . . . , M, v — 1, . . . , V put within a n x M matrix 



.4 



/ ^(A) ^7 1} (A) \ 

^ 2) (^ 2 ) g { M 2 \c 2 ) 

\gt v \c v ) 9 { m V) (£v) ) 
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This matrix is used to compute the coefficients aci, . . . , a at using the Expectation- 
Maximization algorithm. Finally, for the output model, we re-estimate the individ- 
ual densities gi, . . . , gu from the whole data C. 

• In [H] the densities {g m } m =i t ...,M are fixed in advance like for stacking (KDE estima- 
tors with different bandwidths). The dataset is split in two parts. The first sample 
is used to estimate the densities g m , whereas the coefficients a m are optimized using 
the second sample. The splitting process is repeated and the aggregated estimators 
for each data split are averaged. The final model has the form 

where S is the set of all the splits used and 

M 
m=l 

is the aggregated estimator obtained from one split s of the data, g s m is the individual 
kernel density function estimated over the learning sample obtained from the split 
s. This algorithm is called AggPure. 



2.2 Multiplicative aggregation 

The only algorithm giving rise to this form of aggregation is the one described in [4j called 
BoostKde. It is a sequential algorith where at each step m the weak learner is computed 
as follows: 

i=l v 

where K is a fixed kernel, h its bandwidth, and w m (i) the weight of observation i at step 
m. Like for boosting, the weight of each observation is updated: 



w m+ i{i) = w m (i) +log 



gm \Xi 



(-%) n w C) x -x " M 

where g m l \xi) = ]T ^A2lj{ (5*^3). The output is given by /a/0) = C f] 9m{x) 

j=l,j^i ' m=l 

where C is a normalization constant. The Algorithm is resumed in figure [TJ 



4 



1. For i = 1, . . . , n, initialize the weigths of the observations w\(i) = — and fix the 
bandwidth h. 

2. For m = 1 to M: 

(a) Compute the weighted kernel estimate 



w m (i) fx - Xi\ 

) = h— K {—) 



i=i 



(b) Update the weights w m+ %(i) = w m (i) + log I ^ 



9m(3)i) 



9m 



3 ±* 



withg^'(x t )= £ , 

M 

3. Output: /m(x) = C II 9m{x) (C normalization constant such that Jm inte- 

m=l 

grates to unity). 



Figure 1: Boosting kernel density estimation algorithm ((BoostKde), [4]) 



3 Aggregating Histograms 

We suggest three new density estimators obtained by linear combination like in (GQ), all of 
them use histograms as weak learners. The first two algorithms may be parallelized and 
randomize the histograms. The third one is just a modification of Stacking. 

The first algorithm is similar to Bagging ([!]). At each step m, a bootstrap sample 
of the original dataset is generated and g m is the histogram obtained from the generated 
bootstrap sample with a fixed number of equally spaced breakpoints. We will refer to this 
algorithm as BagHist, it is detailed in figure |2j 

The second algorithm (AggregHist) works as follows: let go be the histogram obtained 
with the data set at hand using equally spaced breakpoints denoted B — b\, & 2 , Each 
weak learner g m is an histogram constructed over the same initial data set but using a 
randomly modified set of breakpoints; 7 is a tuning parameter which controls the variance 
of the perturbations. The algorithm is detailed in figure [3j 

Finally, we introduce a third algorithm called StackHist where we replace in the stack- 
ing algorithm described in the previous section, the six kernel density estimators by three 
histograms with fixed number of breaks. 

The values of the parameters used in these algorithms will be optimized. The proce- 
dure used will be described in the experiments section. 
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1. Let C be the original sample 

2. For m = 1 to M: 

(a) Let C m be a bootstrap sample of C 

(b) Set g m to be the histogram constructed over C m with equispaced L break- 
points. 

3. Output: f M (x) = jjYa 9m(x) 



Figure 2: Bagging of histograms (BagHist) 



1. Let £ be the original sample, go be the histogram constructed over C and 
B = {&i, 62, 6l} the set of the ordered optimized breakpoints. 

2. For m = 1 to M: 

(a) Set B m = &( 2 )' ^(L)} m °dihed breakpoints obtained by setting 
6 Z * = k + ei where £/ ~ iV(0, a) and a = 7 mmi</<i, {6; - 

(b) Set <7 m to be the histogram constructed over £ using the breakpoints B m . 

3. Output: /Af(ar) = jiYa 9m{x) 



Figure 3: Aggregating histograms based on randomly perturbed breakpoints (AggregHist) 

4 Experiments 

We test several simulation models based on classical distributions and mixture models 
mostly used in the cited works and algorithms described above. The sample size is fixed 
at n= 100,500,1000. 

We first show the estimates obtained using BagHist and AggregHist and analyze the 
effect of the number M of histograms aggregated, and then compare them to the other 
algorithms. 

Up to our knowledge the existing algorithms for density estimation by aggregation 
have never been compared over a common benchmark simulation data. 

4.1 Models used for the simulations 

We denote by Ml, . . . , A411 the different simulation models which will be grouped ac- 
cording to their difficulty level. 
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Some standard densities used in [I]: 
(Ml): standard Gaussian density N(0, 1) 



x < 
e~ x x > 



(M2): standard exponential density f(x) = 

(M3): a Chisquare density Xio 
(M4): a Student density £ 4 

Some Gaussian mixtures taken from [I] and |llj : 

(Mb): 0.5iV(-l,0.3) + 0.5iV(l,0.3) 

(M6): 0.5iV(-2.5,l) + 0.5JV(2.5,l) 

(Ml): 0.25iV(-3, 0.5) + 0.5iV(0, 1) + 0.25iV(3, 0.5) 

Gaussian mixtures used in from [S] and taken from [7] 

4 

(M8): the Claw density, 0.5N(0, 1) + £ TqN (f - 1, i) 



(M9): the Smooth Comb Density £ ir N 



i=0 

' 65 -96i fill 



j=0 



2 S ~' Ar / u - J - au ? V63, 
21 



32 
63 



N (-**) + ™ N (E,}*) + ± N (1±, L) + ± N (*,±) + L N (« t 2.) + ± N (* t 1) 

V 21 63/ 63 \21 63/ 63 \21 63/ 63 \21 63/ 63 \21 63/ 63 \21 63/ 



Mixtures density with highly inhomogeneous smoothness as in [9]: 
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(M10): 0.5iV(0, 1) + 0.5 £ l ^-p ^ 

= 1 I T ' T J 



1= 

14 



(Mil): 0.5iV(0, 1) + 0.5 £ l/ap-p ^ 

8=1 I T ' T J 

All the simulations are done with the R software, and for models A48 and M9 we use 
the benchden package. 

We show below in figures HI |5] and EJ the true densities for the eleven models as well 
as their estimates obtained using the three algorithms AggregHist, BagHist and StackHist 
for n = 500 observations and M = 300 histograms for the two first algorithms. 

AggregHist and BagHist give more smooth estimators than StackHist. 
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Model 1 Model 2 




Figure 4: Densities used in simulation models 1 to 4 together with the corresponding 
histogram and the estimators given by AggregHist, BagHist and StackHist. 
















Model 7 







Figure 5: Densities used in simulation models 5 to 7 together with the corresponding 
histogram and the estimators given by AggregHist, BagHist and StackHist. 
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Model 8 Model 9 




Figure 6: Densities used in simulation models 8 to 11 together with the corresponding 
histogram and the estimators given by AggregHist, BagHist and StackHist. 

4.2 Tuning the algorithms 

For the existing algorithms we have used the same values suggested by their corresponding 
authors : 

• For Stacking, six kernel density estimators are aggregated, three of them use Gaus- 
sian kernels with fixed bandwidths h = 0.1,0.2,0.3 and three triangular kernels 
with bandwidths h = 0.1,0.2,0.3. The number of cross validation samples is fixed 
to V = 10. 

• For AggPure six kernel density estimators are aggregated having bandwidths 0.001, 
0.005,0.01,0.05,0.1 and 0.5. We use the EM algorithm to optimize the coefficients 
of the linear combination. The final estimator is a mean over S = 10 random splits 
of the original data set. 

• For Boostkde, we use 5 steps for the algorithm aggregating kernel density estimators 
whose bandwidths are optimized using Silverman rule. Normalization of the output 
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is done using numerical integration. Extensive simulations showed that more steps 
give rise to less accurate estimators. 

Simple one kernel density estimators are also used in our comparisons using optimized 
bandwidths following Silverman rule (KdeNrdO) and unbiased cross validation (KdeUCV). 
For the Histogram, fixed breaks are systematically used and their number is optimized 
over a fixed grid, retaining the one which maximizes the log likelihood of the obtained 
histogram over a test sample drawn from the same distribution as the learning sample. 
The tuning parameters for our algorithms, the number of breakpoints and the value of 7 
are optimized testing different values for each of them over a fixed grid. We test 10, 20 and 
50 equally spaced breakpoints for each case. For 7 we chose the grid 0.5, 1, 1.5, 2, 2.5. The 
best combination retained for each model is the one which maximizes the log likelihood 
over 100 independent test samples drawn from the corresponding model. For BagHist 
and AggregHist we aggregate M = 300 histograms. The optimal values for the his- 
togram and for our algorithms are give in table [H We denote the optimal number of 
breaks L^, Lbh, Lah for the Histogram, BagHist and AggregHist respectively, and jah 
the perturbation coefficient for AggregHist. 







n = 


100 




n = 500 




n = 


1000 






L H 


Lah 


Lbh 


1AH 


L H 


Lah L 


BH 


1AH 


L H 


Lah 


Lbh 


1AH 


Ml 


50 


10 


50 


1.0 


50 


10 


10 


0.5 


50 


20 


20 


0.5 


M2 


50 


10 


50 


0.5 


50 


50 


50 


0.5 


50 


50 


50 


0.5 


M3 


50 


10 


50 


0.5 


50 


10 


50 


0.5 


50 


20 


50 


0.5 


MA 


50 


20 


50 


0.5 


50 


50 


50 


0.5 


50 


50 


50 


0.5 


M5 


50 


20 


50 


1.0 


50 


50 


50 


2.0 


50 


50 


50 


1.0 


M6 


50 


10 


50 


1.0 


50 


20 


20 


1.0 


50 


50 


20 


2.0 


M7 


50 


20 


10 


0.5 


20 


20 


20 


0.5 


20 


50 


20 


0.5 


M8 


50 


20 


50 


0.5 


50 


50 


50 


0.5 


50 


50 


50 


2.0 


M9 


50 


20 


50 


0.5 


50 


50 


50 


0.5 


50 


50 


50 


1.0 


M10 


50 


50 


50 


0.5 


50 


50 


50 


0.5 


50 


50 


50 


0.5 


Mil 


50 


20 


50 


0.5 


50 


50 


50 


0.5 


50 


50 


50 


0.5 



Table 1: Optimal parameters values used for our algorithms. 



Finally, for StackHist we aggregate six histograms having 5,10,20,30,40 and 50 
equally spaced breakpoints. A ten fold cross validation is used. 

4.3 Results 

The performance of each model is evaluated using the Mean Integrated Squared Error 
(MISE). It is estimated as the average of the integrated squared error over 100 simula- 
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tions. First, for both AggregHist and BagHist we analyze the effect of the number M of 
histograms aggregated. Figures [3> [S] and M show how the MISE varies when increasing the 
number of histograms. These graphics show clearly the contribution of the aggregation 
to the reduction of the MISE. For all the models, the error does not decrease significantly 
after about 100 iterations. 



Model 1 Model 2 




1 2 5 10 20 50 100 1 2 5 10 20 50 100 



Model 3 Model 4 




1 2 5 10 20 50 100 1 2 5 10 20 50 100 

M M 



Figure 7: MISE error versus number of aggregated histograms for Models 1 to 4. 
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Model S 



AggHist 




Model 7 



Figure 8: MISE error versus number of aggregated histograms for Models 5 to 7. 



2 5 10 20 50 100 



2 5 10 20 50 100 




2 5 10 20 50 100 



2 5 10 20 50 100 



Figure 9: MISE error versus number of aggregated histograms for Models 8 to 11. 
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We compare now our algorithms BagHist, AggregHist and StachHist to the following 
methods: the Histogram, KdeNrdO, KdeUCV, Stacking, AggPure and BoostKde. We have 
limited the comparisons to the ensemble methods which aggregated non parametric den- 
sity estimators. 

Tables 1 to 3 give the values of 100 x MISE for each method and simulation model 
for the three values of n. The best performances are indicated in bold. 
In most cases, aggregation models have higher accuracy than simple methods like the his- 
togram and KDE. However this is not the case for model .M3: KDE has a better perfor- 
mance, that is probably due to border effects. BoostKde gives in general better estimates 
for mixture models. All the methods have better accuracy in general when increasing the 
sample size n. BagHist and AggregHist always outperform the optimal Histogram and in 
general both have better accuracy than StackHist. For n = 100 BagHist and AggregHist 
outperform the other algorithms over the complicated models (Ai8 — AilO). For n = 500 
and n = 1000, AggPure outperforms the other methods for the last two models. Although 
our algorithms do not always outperform the other methods, their precision are not far 
from the best ones. 



Table 2: n = 100, M = 300, 100 x MISE 



100 


Histogram 


KdeNrdO 


KdeUCV 


BoostKde 


AggPure 


Stacking 


StackHist 


BagHist 


AggregHist 


Ml 


2.870 


0.193 


0.221 


0.831 


0.365 


0.233 


0.511 


2.250 


0.209 


M2 


4.820 


7.040 


4.710 


9.320 


2.750 


2.610 


1.580 


4.330 


1.560 


M3 


0.162 


0.012 


0.013 


0.035 


0.106 


0.061 


0.032 


0.127 


0.015 


M4 


1.380 


0.181 


0.219 


1.460 


0.318 


0.202 


0.370 


1.130 


0.220 


M5 


6.700 


6.410 


3.070 


0.387 


1.040 


0.856 


1.940 


5.040 


0.911 


M6 


0.739 


0.149 


0.082 


0.056 


0.189 


0.111 


0.178 


0.550 


0.080 


Ml 


0.820 


0.278 


0.156 


0.107 


0.163 


0.110 


0.168 


0.126 


0.128 


M8 


4.310 


2.280 


2.130 


3.840 


1.910 


1.620 


1.850 


3.390 


1.560 


M9 


2.460 


2.330 


1.350 


1.350 


0.967 


0.927 


1.130 


1.950 


0.832 


MIO 


5.340 


6.750 


6.940 


6.090 


4.480 


4.700 


4.780 


3.640 


3.720 


Mil 


6.130 


6.620 


6.840 


5.590 


5.160 


5.650 


5.590 


4.940 


5.130 
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Table 3: n = 500, M = 300, lOOx MISE 



500 


Histogram 


KdeNrdO 


KdcUCV 


BoostKde 


AggPure 


Stacking 


StackHist 


BagHist 


AggregHist 


Ml 


0.474 


0.085 


0.100 


0.098 


0.076 


0.054 


0.154 


0.085 


0.056 


M2 


0.786 


5.980 


2.840 


6.930 


1.280 


1.120 


0.545 


0.701 


0.720 


M3 


0.024 


0.003 


0.003 


0.029 


0.020 


0.012 


0.009 


0.018 


0.004 


MA 


0.180 


0.099 


0.113 


1.160 


0.059 


0.043 


0.114 


0.131 


0.072 


M5 


1.190 


5.070 


2.090 


0.132 


0.430 


0.254 


0.591 


0.870 


0.424 


M6 


0.133 


0.094 


0.037 


0.014 


0.036 


0.024 


0.056 


0.040 


0.024 


M7 


0.079 


0.206 


0.085 


0.021 


0.036 


0.035 


0.067 


0.050 


0.034 


M8 


0.859 


2.120 


1.770 


2.010 


0.768 


0.588 


0.683 


0.547 


0.531 


M9 


0.594 


2.020 


0.896 


1.070 


0.443 


0.393 


0.428 


0.488 


0.377 


M10 


3.680 


6.380 


5.690 


5.640 


2.260 


2.740 


3.790 


2.800 


3.420 


Mil 


4.700 


6.350 


6.120 


5.880 


2.760 


4.430 


4.900 


4.160 


4.000 



Table 4: n = 1000, M = 300, 100 x MISE 



1000 


Histogram 


KdeNrdO 


KdcUCV 


BoostKde 


AggPure 


Stacking 


StackHist 


BagHist 


AggregHist 


Ml 


0.207 


0.066 


0.081 


0.136 


0.039 


0.032 


0.087 


0.064 


0.036 


M2 


0.374 


5.690 


2.500 


6.090 


0.896 


0.850 


0.365 


0.339 


0.676 


M3 


0.011 


0.002 


0.002 


0.014 


0.009 


0.006 


0.005 


0.008 


0.002 


M4 


0.099 


0.071 


0.076 


0.816 


0.036 


0.028 


0.074 


0.062 


0.035 


M5 


0.593 


4.530 


1.730 


0.062 


0.258 


0.146 


0.318 


0.420 


0.216 


M6 


0.064 


0.078 


0.030 


0.008 


0.018 


0.013 


0.032 


0.023 


0.024 


Ml 


0.054 


0.181 


0.066 


0.012 


0.021 


0.020 


0.042 


0.031 


0.031 


M8 


0.590 


2.070 


1.690 


1.530 


0.498 


0.381 


0.506 


0.336 


0.506 


M9 


0.406 


1.890 


0.767 


0.902 


0.307 


0.286 


0.331 


0.332 


0.342 


M10 


3.830 


6.220 


5.170 


5.360 


1.600 


2.450 


3.880 


2.980 


3.490 


Mil 


4.700 


6.220 


5.590 


5.340 


1.950 


4.100 


4.790 


4.380 


4.070 



5 Conclusion 

In this work we present three new algorithms for density estimation aggregating his- 
tograms. Two of them aggregate histograms over bootstrap samples of the data or ran- 
domly perturbed breakpoints. The third is a simple adaptation of the stacking algorithm 
where histograms are used instead of kernel density estimators. We have shown using ex- 
tensive simulations that these algorithms and the other ensembles techniques have better 
accuracy than the histogram or KDE. The first two algorithms BagHist and AggregHist 
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are very simple to implement, depend on very few parameters, and their computation 
complexity is proportional to that of a histogram. Theoretical properties of these algo- 
rithms are under study. Most of the algorithms described in this work may be easily 
generalized to the multivariate case. 
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